{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0069fced",
   "metadata": {},
   "source": [
    "首先加载将要使用的软件包。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "2a05d0f9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import os\n",
    "# data visualization\n",
    "import seaborn as sns\n",
    "%matplotlib inline\n",
    "from matplotlib import pyplot as plt\n",
    "from matplotlib import style"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "9e8a7e1c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "../dataset/600031.csv\n"
     ]
    }
   ],
   "source": [
    "# Display the folders and files in current directory;\n",
    "import os\n",
    "for dirname, _, filenames in os.walk('../dataset/'):\n",
    "    for filename in filenames:\n",
    "        print(os.path.join(dirname, filename))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f774c46e",
   "metadata": {},
   "source": [
    "add Markdown\n",
    "调用pandas的函数 read_csv() 读取 csv 文件内容到DataFrame df_train。\n",
    "\n",
    "读取数据以后，检查 df_train 的最初几行数据，同时可以看到每一列特征值。有些特征是是数值型的，有一些是字符型的。其中'Survived'是旅客的幸存标志。这一幸存标志在以后的例子中会作为训练数据的标签(the targets for training data)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "4be03bd5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>日期</th>\n",
       "      <th>开盘</th>\n",
       "      <th>收盘</th>\n",
       "      <th>最高</th>\n",
       "      <th>最低</th>\n",
       "      <th>成交量</th>\n",
       "      <th>成交额</th>\n",
       "      <th>振幅</th>\n",
       "      <th>涨跌幅</th>\n",
       "      <th>涨跌额</th>\n",
       "      <th>换手率</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2020-11-09</td>\n",
       "      <td>27.80</td>\n",
       "      <td>27.54</td>\n",
       "      <td>28.14</td>\n",
       "      <td>26.89</td>\n",
       "      <td>856249</td>\n",
       "      <td>2348800992</td>\n",
       "      <td>4.55</td>\n",
       "      <td>0.18</td>\n",
       "      <td>0.05</td>\n",
       "      <td>1.01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2020-11-10</td>\n",
       "      <td>27.98</td>\n",
       "      <td>27.32</td>\n",
       "      <td>27.98</td>\n",
       "      <td>26.91</td>\n",
       "      <td>558824</td>\n",
       "      <td>1523209504</td>\n",
       "      <td>3.89</td>\n",
       "      <td>-0.80</td>\n",
       "      <td>-0.22</td>\n",
       "      <td>0.66</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2020-11-11</td>\n",
       "      <td>27.33</td>\n",
       "      <td>28.31</td>\n",
       "      <td>28.86</td>\n",
       "      <td>27.21</td>\n",
       "      <td>1015374</td>\n",
       "      <td>2873205488</td>\n",
       "      <td>6.04</td>\n",
       "      <td>3.62</td>\n",
       "      <td>0.99</td>\n",
       "      <td>1.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2020-11-12</td>\n",
       "      <td>28.31</td>\n",
       "      <td>28.70</td>\n",
       "      <td>28.88</td>\n",
       "      <td>27.91</td>\n",
       "      <td>718595</td>\n",
       "      <td>2041558432</td>\n",
       "      <td>3.43</td>\n",
       "      <td>1.38</td>\n",
       "      <td>0.39</td>\n",
       "      <td>0.85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2020-11-13</td>\n",
       "      <td>28.28</td>\n",
       "      <td>28.16</td>\n",
       "      <td>28.50</td>\n",
       "      <td>27.60</td>\n",
       "      <td>674807</td>\n",
       "      <td>1888773520</td>\n",
       "      <td>3.14</td>\n",
       "      <td>-1.88</td>\n",
       "      <td>-0.54</td>\n",
       "      <td>0.80</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           日期     开盘     收盘     最高     最低      成交量         成交额    振幅   涨跌幅  \\\n",
       "0  2020-11-09  27.80  27.54  28.14  26.89   856249  2348800992  4.55  0.18   \n",
       "1  2020-11-10  27.98  27.32  27.98  26.91   558824  1523209504  3.89 -0.80   \n",
       "2  2020-11-11  27.33  28.31  28.86  27.21  1015374  2873205488  6.04  3.62   \n",
       "3  2020-11-12  28.31  28.70  28.88  27.91   718595  2041558432  3.43  1.38   \n",
       "4  2020-11-13  28.28  28.16  28.50  27.60   674807  1888773520  3.14 -1.88   \n",
       "\n",
       "    涨跌额   换手率  \n",
       "0  0.05  1.01  \n",
       "1 -0.22  0.66  \n",
       "2  0.99  1.20  \n",
       "3  0.39  0.85  \n",
       "4 -0.54  0.80  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load data\n",
    "df_train = pd.read_csv('../dataset/600031.csv')\n",
    "# Show first lines of data\n",
    "df_train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76e2c40c",
   "metadata": {},
   "source": [
    "# 数据检查"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f84c04fe",
   "metadata": {},
   "source": [
    "df_train.shape 显示数据的行数和列数；df_info() 显示数据的基本信息，包括每一列特征值的名称，行数，数值类型，和占用的内存空间。\n",
    "\n",
    "df_trian.desvribe()显示数据每一列的统计信息：平均值，方差，最大值，最小值，分位数，等。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "975cf5e1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(252, 11)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "15f3fdc8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 252 entries, 0 to 251\n",
      "Data columns (total 11 columns):\n",
      " #   Column  Non-Null Count  Dtype  \n",
      "---  ------  --------------  -----  \n",
      " 0   日期      252 non-null    object \n",
      " 1   开盘      252 non-null    float64\n",
      " 2   收盘      252 non-null    float64\n",
      " 3   最高      252 non-null    float64\n",
      " 4   最低      252 non-null    float64\n",
      " 5   成交量     252 non-null    int64  \n",
      " 6   成交额     252 non-null    int64  \n",
      " 7   振幅      252 non-null    float64\n",
      " 8   涨跌幅     252 non-null    float64\n",
      " 9   涨跌额     252 non-null    float64\n",
      " 10  换手率     252 non-null    float64\n",
      "dtypes: float64(8), int64(2), object(1)\n",
      "memory usage: 21.8+ KB\n"
     ]
    }
   ],
   "source": [
    "df_train.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b85f8e76",
   "metadata": {},
   "source": [
    "# **缺失数据的处理**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "d87cd4e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Function to check the missing percent of a DatFrame;\n",
    "def check_missing_data(df):\n",
    "    total = df.isnull().sum().sort_values(ascending = False)\n",
    "    percent = round(df.isnull().sum().sort_values(ascending = False) * 100 /len(df),2)\n",
    "    return pd.concat([total, percent], axis=1, keys=['Total','Percent'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "2e8eca89",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Total</th>\n",
       "      <th>Percent</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>日期</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>开盘</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>收盘</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>最高</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>最低</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>成交量</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>成交额</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>振幅</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>涨跌幅</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>涨跌额</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>换手率</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Total  Percent\n",
       "日期       0      0.0\n",
       "开盘       0      0.0\n",
       "收盘       0      0.0\n",
       "最高       0      0.0\n",
       "最低       0      0.0\n",
       "成交量      0      0.0\n",
       "成交额      0      0.0\n",
       "振幅       0      0.0\n",
       "涨跌幅      0      0.0\n",
       "涨跌额      0      0.0\n",
       "换手率      0      0.0"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "check_missing_data(df_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c41332a7",
   "metadata": {},
   "source": [
    "一个简单的处理缺失数据的方式，就是用0和1填补 'something'的缺失数据。因为something本身的字符没有意义，所以在 something缺失的位置填0，并且用1替代something的原有非空字符。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0e9460e",
   "metadata": {},
   "source": [
    "此外：\n",
    "* 还可以用对应列的平均值填补缺失的数据。\n",
    "\n",
    "    `#complete missing age with median\n",
    "    data['Age'].fillna(data['Age'].mean(), inplace = True)`\n",
    "\n",
    "* 用相应列中出现最多的字符来代替缺失的值，这通过 dataframe 的函数 data['Embarked'].mode() 来实现。\n",
    "\n",
    "    `#complete missing Embarked with Mode\n",
    "    data['Embarked'].fillna(data['Embarked'].mode()[0], inplace = True)`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02dfcaaf",
   "metadata": {},
   "source": [
    "# **数据清理**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a8ee65f",
   "metadata": {},
   "source": [
    "* 检查冗余信息：相关性是检查冗余信息的有效方法。当某两列的相关系数非常高（接近1.0）的时候，我们认为这两列的信息是冗余的。以下的相关系数矩阵显示这个例子中没有冗余信息。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "61e8c58c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>开盘</th>\n",
       "      <th>收盘</th>\n",
       "      <th>最高</th>\n",
       "      <th>最低</th>\n",
       "      <th>成交量</th>\n",
       "      <th>成交额</th>\n",
       "      <th>振幅</th>\n",
       "      <th>涨跌幅</th>\n",
       "      <th>涨跌额</th>\n",
       "      <th>换手率</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>开盘</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.987127</td>\n",
       "      <td>0.994168</td>\n",
       "      <td>0.995600</td>\n",
       "      <td>-0.062605</td>\n",
       "      <td>0.309691</td>\n",
       "      <td>0.466374</td>\n",
       "      <td>-0.025143</td>\n",
       "      <td>-0.041680</td>\n",
       "      <td>-0.062120</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>收盘</th>\n",
       "      <td>0.987127</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.995569</td>\n",
       "      <td>0.994422</td>\n",
       "      <td>-0.053926</td>\n",
       "      <td>0.316265</td>\n",
       "      <td>0.500237</td>\n",
       "      <td>0.121713</td>\n",
       "      <td>0.107988</td>\n",
       "      <td>-0.053420</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>最高</th>\n",
       "      <td>0.994168</td>\n",
       "      <td>0.995569</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.995075</td>\n",
       "      <td>-0.028074</td>\n",
       "      <td>0.345375</td>\n",
       "      <td>0.527361</td>\n",
       "      <td>0.059532</td>\n",
       "      <td>0.042311</td>\n",
       "      <td>-0.027562</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>最低</th>\n",
       "      <td>0.995600</td>\n",
       "      <td>0.994422</td>\n",
       "      <td>0.995075</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>-0.085634</td>\n",
       "      <td>0.286161</td>\n",
       "      <td>0.443406</td>\n",
       "      <td>0.039373</td>\n",
       "      <td>0.026148</td>\n",
       "      <td>-0.085168</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>成交量</th>\n",
       "      <td>-0.062605</td>\n",
       "      <td>-0.053926</td>\n",
       "      <td>-0.028074</td>\n",
       "      <td>-0.085634</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.921587</td>\n",
       "      <td>0.530023</td>\n",
       "      <td>0.104503</td>\n",
       "      <td>0.061271</td>\n",
       "      <td>0.999993</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>成交额</th>\n",
       "      <td>0.309691</td>\n",
       "      <td>0.316265</td>\n",
       "      <td>0.345375</td>\n",
       "      <td>0.286161</td>\n",
       "      <td>0.921587</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.704641</td>\n",
       "      <td>0.109323</td>\n",
       "      <td>0.061693</td>\n",
       "      <td>0.921802</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>振幅</th>\n",
       "      <td>0.466374</td>\n",
       "      <td>0.500237</td>\n",
       "      <td>0.527361</td>\n",
       "      <td>0.443406</td>\n",
       "      <td>0.530023</td>\n",
       "      <td>0.704641</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.289007</td>\n",
       "      <td>0.235551</td>\n",
       "      <td>0.530612</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>涨跌幅</th>\n",
       "      <td>-0.025143</td>\n",
       "      <td>0.121713</td>\n",
       "      <td>0.059532</td>\n",
       "      <td>0.039373</td>\n",
       "      <td>0.104503</td>\n",
       "      <td>0.109323</td>\n",
       "      <td>0.289007</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.979940</td>\n",
       "      <td>0.104511</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>涨跌额</th>\n",
       "      <td>-0.041680</td>\n",
       "      <td>0.107988</td>\n",
       "      <td>0.042311</td>\n",
       "      <td>0.026148</td>\n",
       "      <td>0.061271</td>\n",
       "      <td>0.061693</td>\n",
       "      <td>0.235551</td>\n",
       "      <td>0.979940</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.061297</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>换手率</th>\n",
       "      <td>-0.062120</td>\n",
       "      <td>-0.053420</td>\n",
       "      <td>-0.027562</td>\n",
       "      <td>-0.085168</td>\n",
       "      <td>0.999993</td>\n",
       "      <td>0.921802</td>\n",
       "      <td>0.530612</td>\n",
       "      <td>0.104511</td>\n",
       "      <td>0.061297</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           开盘        收盘        最高        最低       成交量       成交额        振幅  \\\n",
       "开盘   1.000000  0.987127  0.994168  0.995600 -0.062605  0.309691  0.466374   \n",
       "收盘   0.987127  1.000000  0.995569  0.994422 -0.053926  0.316265  0.500237   \n",
       "最高   0.994168  0.995569  1.000000  0.995075 -0.028074  0.345375  0.527361   \n",
       "最低   0.995600  0.994422  0.995075  1.000000 -0.085634  0.286161  0.443406   \n",
       "成交量 -0.062605 -0.053926 -0.028074 -0.085634  1.000000  0.921587  0.530023   \n",
       "成交额  0.309691  0.316265  0.345375  0.286161  0.921587  1.000000  0.704641   \n",
       "振幅   0.466374  0.500237  0.527361  0.443406  0.530023  0.704641  1.000000   \n",
       "涨跌幅 -0.025143  0.121713  0.059532  0.039373  0.104503  0.109323  0.289007   \n",
       "涨跌额 -0.041680  0.107988  0.042311  0.026148  0.061271  0.061693  0.235551   \n",
       "换手率 -0.062120 -0.053420 -0.027562 -0.085168  0.999993  0.921802  0.530612   \n",
       "\n",
       "          涨跌幅       涨跌额       换手率  \n",
       "开盘  -0.025143 -0.041680 -0.062120  \n",
       "收盘   0.121713  0.107988 -0.053420  \n",
       "最高   0.059532  0.042311 -0.027562  \n",
       "最低   0.039373  0.026148 -0.085168  \n",
       "成交量  0.104503  0.061271  0.999993  \n",
       "成交额  0.109323  0.061693  0.921802  \n",
       "振幅   0.289007  0.235551  0.530612  \n",
       "涨跌幅  1.000000  0.979940  0.104511  \n",
       "涨跌额  0.979940  1.000000  0.061297  \n",
       "换手率  0.104511  0.061297  1.000000  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train.corr()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2162277",
   "metadata": {},
   "source": [
    "* 标准方差用于检测无用信息。当某一列的方差接近0.0时，这一列的所有值几乎时相同的，也就是提供了无用信息。这样的列可以被删除。我们的例子中，方差都远大于0，可以认为没有无用信息。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "34d2e227",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\ADMINI~1\\AppData\\Local\\Temp/ipykernel_7032/905067297.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.\n",
      "  df_train.std()\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "开盘     5.994146e+00\n",
       "收盘     6.003877e+00\n",
       "最高     6.307932e+00\n",
       "最低     5.695554e+00\n",
       "成交量    6.820901e+05\n",
       "成交额    2.159678e+09\n",
       "振幅     2.011476e+00\n",
       "涨跌幅    2.977870e+00\n",
       "涨跌额    1.011197e+00\n",
       "换手率    8.032729e-01\n",
       "dtype: float64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train.std()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8300a75f",
   "metadata": {},
   "source": [
    "# **非数值变量的处理**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3804d4e",
   "metadata": {},
   "source": [
    "字符串类型，我们需要将其转化为数值变量。以下的代码中，我们将用0取代和用1取代。"
   ]
  },
  {
   "cell_type": "raw",
   "id": "9d0d7d6b",
   "metadata": {},
   "source": [
    "# Forexample:\n",
    "###  Convert ‘Sex’ feature into numeric.\n",
    "genders = {\"male\": 0, \"female\": 1}\n",
    "all_data = [df_train]\n",
    "\n",
    "for dataset in all_data:\n",
    "    dataset['Sex'] = dataset['Sex'].map(genders)\n",
    "df_train['Sex'].value_counts()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "deep",
   "language": "python",
   "name": "deep"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
